Skip to content

Conversation

@sfluegel05
Copy link
Collaborator

@sfluegel05 sfluegel05 commented Oct 14, 2025

This PR adds new model / dataset features for upgrading our ensemble. This includes

Models

  • Logistic Regression (although it does not perform very well)
  • functioning LSTMs (better than LR, but not as good as Electra)
  • freeze Electra weights (this has a negative impact on performance as well)

Datasets

  • subset-based training (only 3_STAR / only 2_STAR molecules - note that training on a subset impacts performance when tested on a different subset))
  • id / label filters - these come in handy when comparing datasets. I have used them to compare 3-STAR trained and full-ChEBI trained models (e.g., I took ChEBI50 with a label-filter from ChEBI50-3STAR to evaluate a 3-star model)
  • PubChem batched dataset - this allows doing a PubChem dataset not as one file, but as batches. Training happens with a 1 batch per epoch scheme. I used this to pretrain a model on the whole PubChem dataset (118 million SMILES, 1 million per batch)

@sfluegel05 sfluegel05 marked this pull request as ready for review November 5, 2025 12:13
@sfluegel05 sfluegel05 merged commit 0925a41 into dev Nov 5, 2025
7 checks passed
@sfluegel05 sfluegel05 deleted the feature/new-ensemble-models branch November 5, 2025 12:13
@sfluegel05 sfluegel05 linked an issue Nov 17, 2025 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Build a 3-star ChEBI dataset

2 participants